Inherent size constraints on prokaryote gene networks due to "accelerating" growth 
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Networks exhibiting "accelerating" growth have total link numbers growing faster than linearly 
with network size and can exhibit transitions from stationary to nonstationary statistics and from 
random to scale-free to regular statistics at particular critical network sizes. However, if for any 
reason the network cannot tolerate such gross structural changes then accelerating networks are 
constrained to have sizes below some critical value. This is of interest as the regulatory gene net- 
works of single celled prokaryotes are characterized by an accelerating quadratic growth and are 
size constrained to be less than about 10,000 genes encoded in DNA sequence of less than about 
10 megabases. This paper presents a probabilistic accelerating network model for prokaryotic gene 
regulation which closely matches observed statistics by employing two classes of network nodes (reg- 
ulatory and non-regulatory) and directed links whose inbound heads are exponentially distributed 
over all nodes and whose outbound tails are preferentially attached to regulatory nodes and de- 
scribed by a scale free distribution. This model explains the observed quadratic growth in regulator 
number with gene number and predicts an upper prokaryote size limit closely approximating the 
observed value. 
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I. INTRODUCTION 

The rapidly expanding field of network analysis, re- 
viewed in P, Q , has provided examples of networks ex- 
hibiting "accelerating" network growth, where link num- 
ber grows faster than linearly with network size. For in- 
stance, the Internet |2J appears to grow by adding links 
more quickly than sites though the relative change over 
time is small and the Internet appears to remain scale free 
and well characterized by stationary statistics Q. Simi- 
larly, the number of links per substrate in the metabolic 
networks of organisms appears to increase linearly with 
substrate number Q, the average number of links per 
scientist in collaboration networks increases linearly over 
time 0, S B 1^ J^J ; E^nd languages appear to evolve via 
accelerated growth |T3| . 

In the main, the chief focus of these studies has been 
on locating parameter regimes allowing accelerating net- 
works to maintain scale free statistics and thereby to al- 
low continued unconstrained growth. For example, an 
early study considered a growing network receiving N" 
new links for a > when the network size is at N nodes, 
but restricted analysis to the case a < 1 as "Obviously, 
a cannot exceed 1 (the total number of links has to be 
smaller than since one may forbid multiple links)." 

and "The density of connections in real networks remains 
rather low all the time, so one may reasonably assume 
that a is small." Equivalent limits were considered 
in Ref. |l3j |. In such restricted parameter regimes net- 
works could maintain scale free statistics, though this 
result carries the implicit but unexamined finding that 
alternate parameter regimes permit transitions from sta- 
tionary to nonstationary statistics. This paper builds on 
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these implicit findings. 

Accelerating networks are more prevalent and im- 
portant in society and in biolo gy than is commonly 
realized — see the survey in Ref. hj. In fact, any net- 
work that requires functional integration and organiza- 
tion (where the activity of any given node is dependent 
on the state of the network or different subnetworks) is 
by definition an accelerating network, that is, as the net- 
work expands, the proportion of the network devoted to 
control and regulation expands disproportionately. This 
in turn means that all such networks, sooner or later, 
must be limited in their size and complexity, which limi- 
tations can only be breached by changing either the phys- 
ical nature of the control architecture (a state transition) 
or by reducing the functional integration. In the latter 
case, where networks are hitting a complexity limit, fur- 
ther growth in network size will likely display structural 
transitions from randomly connected, to scale free statis- 
tics, to densely connected and perhaps finally to fully 
connected statistics. Should such networks be unable 
to successfully complete these transitions for any reason, 
then it is likely that network growth must cease entirely 
or that either a transition to a nonaccelerating structure 
is required to permit further growth or novel technolo- 
gies must appear allowing the continuation of accelerated 
growth. Exemplar accelerating networks displaying such 
size limits or structural transitions include (a) all forms 
of economic markets where the latest price offered by any 
participant instantly affects all other participants, (b) 
industrial companies and sectors implementing a Just- 
In-Time business model where any worker can halt the 
entire production system, (c) error propagation networks 
linking an error source with all affected nodes as studied 
in software analysis and in models of the propagation 
of diseases, bushfires, cracks, and electricity grid fail- 
ures, (d) in any dynamical system dependent on relative 
quantities so changes in one node instantly affects ev- 
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ery other node such as relative transcription factor bind- 
ing probabilities or relative evolutionary fitness, (e) in 
computer hardware and in cluster and grid supercom- 
puter networks, and (f) in organizational networks 14] . 
In fact, it is well understood that social networks only 
take on small world statistics when the network is large 
enough — in small towns everyone one knows everyone 
else so social networks are accelerating, and social net- 
works make a transition to small world statistics only as 
individual nodes saturate their connectivity limits |15| . 
Similar observations can be made about the scale free 
Internet and World Wide Web — when sufficiently small, 
these networks were likely accelerating until connectiv- 
ity capacities saturated forcing a transition to scale free 
structures to permit further growth 

This paper develops an accelerating network model of 
prokaryotic (single celled) gene regulatory networks to in- 
vestigate size and complexity limits inherent in the adop- 
tion of an accelerating architecture. Because our focus is 
on structural transitions, we explicitly do not need to re- 
strict the degree of acceleration to low values of a « 0. 
Rather, we permit this parameter to take on any value 
including a > 1 and ensure that the network is not sat- 
urated by making link formation probabilistic. The re- 
sulting novel "probabilistic" accelerating networks grow 
by adding on average pN°' new links with a > and oth- 
erwise arbitrary provided the probability of adding a link 
is suitably constrained p <C 1 so that total link number 
remains less than of order N"^ . 

The gene regulatory model presented here is motivated 
by comparative genomics findings that the total number 
of regulatory proteins controlling gene expression (links) 
scales quadratically with the number of genes or operons 
(nodes) in prokaryotes jl6lll7j . This quadratic growth re- 
sults as the number of links made by a regulator exploit- 
ing homology dependent (sequence specific) interactions 
scales proportionally to the number of randomly drift- 
ing promotor sequences or effectively, with gene num- 
ber |l7l |. Hence, gene regulatory networks are inherently 
accelerating — the probable number of links per regulator 
pN" increases linearly with node number with a = 1, so 
consequently, the total number of links scales quadrat- 
ically as piV"+^. In small and sparsely connected net- 
works, most links come from different regulators sug- 
gesting that regulator number also scales quadratically 
with gene number, pN'^'^^. Such an accelerating net- 
work would be characterized initially by sparse connec- 
tivity at low gene numbers and subsequently by denser 
connectivity at high gene numbers as networks attempt 
a transition to a densely connected regime. If the evolv- 
ing networks can successfully make this transition, the 
evolutionary record will display a transition in network 
statistics for some critical network size Nc- Conversely, 
if these networks, optimized by evolution in the sparse 
regime, are unable to make the transition to the densely 
connected regime, the evolutionary record would show 
a strict size limit N < Nc at some critical network size. 
But this is exactly what is observed. All prokaryotic gene 



numbers and genomes are indeed of restricted size (less 
than about 10,000 genes with genomes of between 0.5 and 
10 megabases ^3), in contrast to the genomes of multi- 
cellular eukaryotes (with for humans, about 30,000 genes 
and a genome of about 3 gigabases JOJ). Ref. 
predicted the size limit Nc < 20, 000 genes as continued 
genome growth requires the number of new regulators to 
exceed the number of nonregulatory nodes. 

A satisfactory model of prokaryotic gene regulatory 
networks requires some novel features. As mentioned 
above, we introduce probabilistic link formation to al- 
low rapid accelerated growth and correspondingly stricter 
size limits. (A different but related mechanism was in- 
troduced in Refs. ^l^^M which considered the effects of 
stochastic fluctuations in the number of added links with 
each additional node.) In addition, we employ directed 
links and partition nodes into two classes where "regu- 
lators" can source outbound regulatory links to regulate 
other nodes (both regulators and non-regulators), while 
"non-regulators" cannot source outbound links. (Ref. 
[23| has previously considered networks of distinguish- 
able nodes.) Further, experimental evidence presented 
below indicates that the distribution of inbound links is 
compact and exponential while the distribution of out- 
bound links is long-tailed and likely scale-free. As a re- 
sult, the heads and tails of our directed links are placed 
according to two distinct distributions. Altogether, these 
features allow the reproduction of the observed features 
of prokaryote gene regulatory networks and satisfactorily 
predicts the maximum prokaryotic gene count. 

Our approach reproducing accelerating network statis- 
tics for growing prokaryote genomes complements and in- 
forms alternate networking approaches seeking to deduce 
or simulate the regulatory networks of particular organ- 
isms fro m g ene perturbation and microarray experiments 

mmiiai. 

In Sectionmwe canvass the available literature to char- 
acterize the statistics of prokaryote gene regulatory net- 
works. This then allows the construction of accelerat- 
ing growth network models in Section IIIII where we use 
the continuous approximation and simulations to analyze 
network statistics. The size constraints inherent in accel- 
erating prokaryote regulatory networks are modelled in 
Section HVI 



II. OVERVIEW OF PROKARYOTE GENE 
NETWORKS 

Ongoing genome projects are now providing sufficient 
data to usefully constrain analysis of the gene regulatory 
networks of the simpler organisms. Ref. jl6l | first noted 
the essentially quadratic growth in the class of transcrip- 
tional regulators (i?) with the number of genes (Ng) in 
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bacteria with the observed results 

^, transcriptional regulation 
jY2.o7±o.2i^ two component systems 
jY2.03±o.i3^ transcriptional regulation 
^, transcriptional regulation. 
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Here, the top two lines refer to different classes of regula- 
tors while the bottom two lines are the results of a cross- 
checking analysis of two alternate databases. Quoted in- 
tervals reflect 99% confidence limits The explana- 
tion for this quadratic growth was that each additional 
transcription factor doubles the number of available dy- 
namical states which, it was posited, allows for a doubling 
in the fixation probabilities for this class of genes. 

As noted above, Ref. 0| provides an alternate theo- 
retical analysis predicting quadratic growth in any reg- 
ulatory network exploiting homology dependent interac- 
tions and analyzed 89 bacterial and archeael genomes to 
determine the relations 

aN^ = (1.6 ± 0.8)10-5A^gi-36±o.i5 (^2 ^ q gg) 

R = { pNl = (1.10 ± 0.06)10-57V2 (^2 ^ Q 

cNg = (0.055 ± 0.004)iVg (r^ = 0.75). 

(2) 

In this paper, accelerating networks will be based on the 
quadratic second line (while nonaccelerating models pre- 
sented in later work will work with the linear third line 
[27II'). In all cases, the limits reflect 95% confidence lev- 
els. For completeness, the data is shown in Fig. ^ The 
observed quadratic growth implies an ever growing reg- 
ulatory overhead so there will eventually come a point 
where continued genome growth requires the number of 
new regulators to exceed the number of nonregulatory 
nodes, and based on this, Ref. predicted an upper 
size limit of about 20,000 genes, within a factor of two of 
the observed ceiling. 

Earlier surveys of bacterial genomes noted that larger 
genomes harboured more transcription factors per gene 
than smaller ones with this trend attributed to the 
need in larger genomes for a more complex network of 
regulatory proteins to achieve coordinated expression of 
a larger set of cellular functions, and to selection in com- 
plex environments leading to enrichment in transcription 
factors allowing regulation of gene expression and signal 
integration. A similar upward trend in the proportion 
of regulators as a fraction of genome size with increasing 
genome size was observed in Ref. attributed to a 

need for an increasing responsiveness in diverse environ- 
ments, with confirming observations in Ref. |3n| |. 

Prokaryotes typically group their DNA encoded genes 
in opcrons, CO- regulated functional modules of average 
size 1.70 genes each in E. coli which value we treat as 
typical though in reality, operon size decreases slightly 



log R 




2.5 3.0 3.5 log Ng 4.0 

FIG. 1; Double-logarithmic plot of regulatory protein num- 
ber (R) against total gene number (Ng) for bacteria (circles) 
and archaea (triangles), adapted from Ref. T?/. The log- 
log distribution is well described by a straight line with slope 
1.96±0.15 (r^ = 0.88, 95% confidence interval indicated), cor- 
responding to a quadratic relationship between regulator num- 
ber and genome size. The inset shows the same data before 
log-transformation Dashed lines show the best linear fit 

to the data R = (0.055 ± 0.004) A^g (V^ = 0.75/ 



with genome size 31]. Each operon can be either unreg- 
ulated and so constitutively or stochastically expressed or 
subject to combinatoric regulation by multiple regulatory 
protein transcription factors binding to each operon's 
promotor sequence. 

Again assuming that E. coli is typical, any given regu- 
latory protein affects an average of about 5 operons with 
this distribution being long tailed so the majority 
of regulators affect only one operon though some regula- 
tors (CRP) can affect up to 71 operons or 133 genes |33l |. 
(This latter reference estimated that each regulator con- 
trols on average 3 genes.) More recent estimates have the 
transcription factor CRP, a global sensor of food levels in 
the environment, regulating up to 197 genes directly and 
a further 113 genes indirectly via 18 other transcription 
factors [s^. (To observe the long tailed distribution, see 
Fig. 2 of Ref. [13 and Fig. 4 of Ref. 34].) 

However, the number of inputs taken by an operon 
is characterized by a compact exponential distribution 
with a rapidly decaying tail so the majority of regulated 
operons are controlled by a single regulator while very 
few regulated operons are controlled by four, five, six or 
seven regulators jS^i, iSSj, iSJi] . In particular, Ref. [33 ex- 
amined 500 regulatory links from about 100 regulators 
to almost 300 operons to estimate that each regulated 
operon takes on average 2 inputs though Fig. 2 of this 
reference suggests an average input number of about 1.5. 
Similarly, Ref. js^ suggests that 424 regulated operons 
receive 577 links giving an average input number of 1.4, 
while Ref. '33| estimates that 327 regulated operons re- 
ceive 524 links giving an average input number of 1.6. 



4 



III. ACCELERATING PROKARYOTE 
NETWORK MODELS 

We extend the gene network model of Ref. [s^l to con- 
struct an accelerating network model of prokaryote reg- 
ulatory gene networks. Prokaryotes typically pack their 
Ng genes into a lesser number of iV = Ng/go co- regulated 
operons where we assume that operons contain exactly 
go — 1.70 genes. Of the existing operons, Or are regu- 
lated operons and Ou = N — Or are unregulated operons. 
Of the total number of operons, there are R regulatory 
operons whose regulatory interactions are directed links 
from regulatory operons to regulated operons. Under the 
assumption that there is only one regulatory gene per reg- 
ulatory operon, the observed quadratic relation of Eq. [3 
becomes 

R=pglN\ (3) 

When regulators and regulatory links are very rare, i.e. 
when genomes are small, it is likely that every new link 
is associated with a new regulator so the number of links 
varies roughly quadratically with operon number. We 
write 

L = IN^, (4) 

where / denotes the probability of forming a particular 
beneficial link per operon. The value for / will be approx- 
imately pg^ , but the exact relation must be derived from 
the details of the implemented model. 

Each regulatory link between nodes is directed, and 
characterized by two distinct distributions describing re- 
spectively the placement of the heads and tails of each 
link. Only a relatively few nodes are regulatory, and of 
these, the number of outbound link tails per regulatory 
node are described by a size dependent long-tailed dis- 
tribution with average about (t) « 5. Such a long-tailed 
distribution requires that link tails be preferentially at- 
tached to an existing regulatory operon or equivalently, 
the associated regulated operon must possess one promo- 
tor binding site (among others) that binds that particular 
regulator. Consequently, the preferential selection of reg- 
ulators means that the promotor sequences of newly regu- 
lated nodes cannot be randomly chosen — randomly drift- 
ing promotor sequences would be as likely to match any 
one regulator as another. A plausible physical explana- 
tion for the preferential attachment of link tails to exist- 
ing regulators is that newly fixated operons come largely 
from gene duplication events |35j where some of the du- 
plicated promotor binding sites are under strong selec- 
tive constraint while other binding sites and the operon 
genes can drift freely. Gene duplication then implies that 
in a genome of size N operons, if some regulator rij has 
tjpf outbound regulatory links to approximately tjN reg- 
ulated operons, then the probability that a newly fixated 
operon is also regulated by rij is simply the proportion 
of such regulated operons in the genome, or tj^ /N. This 
implements the required preferential attachment as the 



resulting rate of growth in the number of links attached 
to node rij is also then proportional to ijjv- If there is also 
some probability of the appearance of novel promotor se- 
quences, these combined processes suffice to produce the 
observed scale free distributions. This model is roughly 
consistent with recent estimates of the relative contribu- 
tions to prokaryote genome growth which suggest that 
horizontal gene transfer rates jh are roughly one third of 
gene loss rates = 7; /3 and roughly one half of verti- 
cal inheritance or gene genesis rates 7^ — 7^/2 leading 
to roughly constant sized genomes over long times (as 
N K + Iv — li ~ 0), while "it is remarkable that phy- 
logenetic distributions of at least 60% [and up to 75%] 
of protein families can be explained merely by vertical 
inheritance." 36]. Similarly, three quarters of examined 
transcription factors in Ref. pM| were two-domain pro- 
teins with shared domain architectures leading to the 
estimate that about 75% of transcription factors have 
arisen as a consequence of gene duplication (though the 
joint duplication of regulatory regions and of regulated 
genes or of transcription factors together with regulated 
genes is more rare). A further implication of these gene 
duplication processes is that, in the main, regulators can 
only appear on entry to the genome — a potential regu- 
lator lacking any target matches in a given genome will 
never form any links when most operons arise from pro- 
motor preserving duplication events. This allows us to 
considerably simplify our model, and hereafter, we only 
allow regulators to appear on their entry to the genome. 
Of course, more realistic but considerably more compli- 
cated models are possible. 

In contrast to the relatively small number of regulatory 
nodes, all nodes can themselves be regulated by inbound 
links and in fact, can be multiply regulated as promotor 
regions can contain more than one binding site. Further, 
the many used and unused promotor region binding sites 
broadly sample the space of possible binding sites so only 
a small fraction of nodes will be regulated by any one 
regulator. As a result, the number of inbound link heads 
per node is described by a size dependent exponential 
distribution with a low average of {h) « 1.5 as typically 
results from the random or non-preferential attachment 
of inbound links to operon promotor sequences. 

We suppose that the operon network grows by the se- 
quential addition of numbered nodes Uk for 1 < fc < A", 
and that at network size fc, node (1 < z < fc) has tik 
outbound tails and hik inbound heads. We do not model 
the many trials of potential genes over many generations 
and merely include fixated genes in our count — that is, 
drifting sequence is not counted as part of the fixated 
genome. This further implies the sequence of established 
nodes is under severe selective constraint and unable to 
drift so consequently new links cannot be added between 
existing nodes. (If a proportion fN of existing nodes can 
explore novel sequence space in time dt^ then the num- 
ber of new regulators increases as dR oc fN'^dt, and as 
iV is itself a function of time, this integrates to generate 
a non-quadratic relation between regulator and operon 
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FIG. 2: An example statistically generated E. coli genome 
using the later results of this paper where (for convenience 
only) operon nodes numbered ni,...,njv are placed sequen- 
tially counterclockwise on a circle in their historical order of 
entry into the genome. The filled points on the outer circle 
locate regulators and have radius indicating the number of out- 
bound regulatory links. The open points on the middle circle 
locate regulated operons and have radius indicating the number 
of inbound regulatory inputs. The arrows in the inner circle 
show all directed regulatory links. 

number which is not observed.) 

For clarity, Fig. |2] preempts later calculations and de- 
picts a statistically generated version of an E. coli genome 
where nodes are placed sequentially counterclockwise in a 
circle (for convenience only). Alternative genome models 
may be distinguished by the age distribution of regula- 
tors, regulated operons and their link numbers, and these 
are indicated in this figure. In particular. Fig. |21 shows 
a highly nonuniform distribution of regulators and out- 
bound link numbers with gene age in contrast to a uni- 
form distribution of regulated operons and of inbound 
link numbers. (It will turn out that these latter age- 
independent distribution are only present when regulator 
number grows quadratically with genome size.) 

These distributions result from the physical processes 
underlying the formation of regulatory links in prokary- 
otes. As discussed above, a substantial proportion of the 
gene regulation network of prokaryotes is enacted via ho- 
mology dependent interactions as when sequence spec- 
ified protein transcription factors bind to specific pro- 
moter sequences. The undirected nature of evolution- 
ary searches means that gene regulatory networks funda- 
mentally exploit the same sequence matching algorithms 
used in comparative genetics where the probability of ob- 
taining matches between a single given trial sequence of 



some small fixed length and an entire genome scales pro- 
portionately to genome length — doubling genome length 
doubles the probability of a match. Hence, the expected 
number of links formed per regulator scales linearly with 
present genome size. As the number of source trial se- 
quences also scales with genome length, the expected 
number of matches between all regulators and all regu- 
lated operons scales quadratically with genome length, or 
effectively, with operon number assuming constant sized 
operons over the evolutionary record. 

As a consequence, on entry into the genome, each new 
gene has some probability of being a regulator dependent 
firstly on its suitability to bind DNA and secondly on the 
linearly increasing expected number of acceptable bind- 
ing targets present in the genome on entry (or at later 
times). As discussed above, the predominance of verti- 
cal gene genesis events allows a simplified model wherein 
the probability of a new node being regulatory is deter- 
mined solely by the number of available links present at 
the time of entry. We assume then that on entry into 
the genome each new node rik can form 2fc — 1 links with 
nodes ni, . . . , consisting of a single self-regulatory link 
from node Uk to itself with probability I, (fc — 1) regu- 
latory outbound links to the existing nodes each with 
equal probability I, and, provided that sufficient regula- 
tors already exist, l{k — 1) inbound regulatory links from 
some subset of the existing regulators chosen according 
to preferential attachment. (For consistency, we can only 
add w Ik distinct regulatory links to node provided 
there are at least this many regulators in existence. From 
Eq. 121 the average number of regulators pg^k^ must be 
greater than the number of regulatory links Ik, and this 
will be satisfied for k > l/{pgl) k, 1.) As a result, the 
total number of heads or tails attached to node rik on en- 
try to the genome ranges between and fc, with each link 
formed with probability I. Hence, the respective proba- 
bilities that the initial number of heads hkk = j or the 
initial number of tails tkk = j for node rik is 

pu,k)^Qpii-if-\ (5) 

with the proviso that all the inbound links can only 
be added to node Uk if there is a sufficient number 
of regulators among the nodes ni,...,?ifc. The aver- 
age number of inbound and outbound links is identical, 
{tkk) — (hkk) — Ik showing linear growth in link num- 
ber with increasing network size. The addition of node 
rik and its links will increase the probable number of 
heads attached to earlier nodes rij for 1 < i < (fc — 1) so 
hjk > hjj, while the probable number of tails outbound 
from node rij increases tjk > tjj if and only if that node 
is regulatory with tjj > 0. 

As regulators can only be created on entry to the 
genome, the distribution of regulators at any time is spec- 
ified by the distribution P{j, fc) for tkk- Using Eq. |S1 the 
probability that node Uk is a regulator is 1 — P{0, fc), so 
for a network of N nodes, the predicted total number of 
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The exact top line shows the expected behaviour for the 
number of regulators in the respective limits I — > giving 
i? — > 0, and I 1 giving R ^ N. The approximate rela- 
tion in the third line can be compared to the observed Eq. 
Inland immediately suggests I ~ "^pgo, while a fit to the 
more accurate top line gives the connection probability 



I = 1.15 X 2pgl = 7.31 x 10" 



(7) 



an average promotor 
6.9 bases. The aver- 



This probability value suggests 
binding site length of — log4 I — 
age number of links per regulator using the second line 
of Eq. Elis then approximately L/R « 2, while the more 
accurate top line with N — 2528 operons for E. coli |3ll | 
gives L/R = 2.12, about a factor of two from the ob- 
served value of 5 for E. coli 32,1 . 



A. Random distribution of regulated operons 

The distribution of link heads for all nodes (with pos- 
session of a link head designating a regulated node), 
can be straightforwardly calculated under the assump- 
tion that the tkk ~ Ik new tails added with node nt 
are randomly distributed across the k existing nodes so 
on average, each existing node receives / links. To build 
insight, it is useful to consider the general case where 
tkk ~ hkk ~ ^fc" for a > 0. Setting a = adds with 
some probability a constant number of links with each 
new node, a = \ adds a linearly growing number of prob- 
able links with each new node, a = 2 adds a quadratically 
growing number of probable links with each new node, 
and so on. The total number of links present in the net- 
work is then 



"2/fc" = ^ 
0; + 1 



(8) 



The continuous approximation for links ran- 

domly distributed over k existing nodes determines the 
number of inbound head links for node rij according to 
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This can be integrated with initial conditions hjj « Zj" 
at time j and final conditions tj^ « IN" at time N to 



l + lln^ 
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if a = 



(10) 



j" if a > 0. 



Integration of these link numbers over all node numbers 
j gives the required total number of links as in Eq. |S1 
For < a < 1, the number of links per node is monoton- 
ically decreasing with node number. However, for a = 1 
and only in this case, the final distribution is independent 
of node number j because earlier nodes receive exactly 
enough links from latter nodes to balance the initially 



biased distribution of heads /i, 



Ij, so in the end, all 



nodes receive on average the same number of inbound 
regulatory links (/ijat) = IN for 1 < j < A^. For faster 
acceleration rates, a > 1, the number of links per node is 
monotonically increasing as later nodes receive a greater 
number of links on entry to the genome and this imbal- 
ance is not corrected. 

The possibility of monotonically increasing numbers 
of links with node number in accelerating networks has 
not previously been considered. This possibility requires 
modifying the usual continuum approach ^37„ ^S^, ,39j so 
the final inbound link distribution is obtained via 



1 /-^ 

H{k.N) = - dj 



5{k — hjis[) 



= 4(^)" -t[j=j{k^N)l (11) 

where j{k,N) is the solution of the equation k — hjN- 
The top line is used when all nodes possess the same 
average link number while the second line is applicable 
with the plus (negative) sign when the average numbers 
of links per node is monotonically increasing (decreasing) 
with node number. Non-monotonic cases require alter- 
nate approaches. 

Under quadratic growth in total link number when a — 
1, and only in this case, the final distribution of link heads 
is independent of node number and evaluated using Eq. 
[TTIto give 



H{k,N) 



1 r 

NJo 
5{k - IN) 



dj 5{k - IN) 



(12) 



As expected, a compact final link distribution results 
when all nodes have an average of tjiq = IN inbound 
regulatory links at time N. This distribution calcu- 
lated under the continuous approximation equates to one 
where in reality, each node receives a controlling head 
with probability / from every other node (though in prac- 
tise, the total number of received links is of order unity). 
Hence, for any node in a network of size N, the actual 
probability of having k heads is 



H{k,N) 



l''{l-l) 



N-k 



(13) 
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A network simulation with linear growth of link numbers 
per node model (Eq. [SJl serves to vahdate this predicted 
final distribution. Fig. 13 compares the predicted distri- 
bution H{k,N) against observed distributions for typ- 
ical simulated networks of various sizes with negligible 
discrepancies. 

For a network of size N, the probability that any given 
operon is unregulated is H{0, N) so the expected number 
of unregulated operons summed over all N nodes is 



0„ = Nil - I) 



N 



(14) 



This determines the number of regulated operons as 

Or = N-Ou=N[l-{l-rf], (15) 

showing the expected behaviour as ^ — > 1 giving Or ^ N 
and I giving Or « IN'^ = L as each of the sparsely 
distributed links hits a distinct regulated operon. We 
note that random gene duplication and deletion events 
will not change the H{k,N) distribution (other than 
changing N) as all nodes are identically connected on 
average. The H{k, N) distribution appears in Fig. [21 
which shows a uniform (age-independent) distribution of 
regulated nodes over the genome, and this uniformity is 
only expected for a = 1 corresponding to linear growth in 
link numbers per node and quadratic growth in regulator 
numbers. 



H(k,N) 



N=6000 




FIG. 3: A comparison of the predicted distribution of in- 
bound link numbers per node H{k,N) (solid lines) against 
that observed in simulated networks of various sizes (indicated 
points) with quadratic growth in the total probable number of 
randomly attached links. 

These predictions can be compared to observation. For 
the E. coli network of size N = 2528 operons or 4289 
genes jsj , the predicted proportion of regulated operons 
receiving A: > inputs is 



Ph{k) 



H{k,N) 



1 



(16) 



H{Q,Ny 

and is shown in Fig. 01 Here, the calculated distribution 
closely approximates the compact exponential distribu- 
tion observed for E. coli shown in Fig. 2(d) of Ref. 



FIG. 4: The predicted proportions Ph{k) of the regulated oper- 
ons of E. coli taking multiple regulatory inputs for a genome 
of N = 2528 operons. This distribution closely approximates 
that observed for E. coli in Fig. 2(d) of Ref . '31] and of Fig. 
5 of Ref. H 



and of Fig. 5 of Ref. |3J| , though it underestimates the 
numbers of regulated operons with 4, 5, 6 and 7 inputs — 
essentially no regulators are predicted to have 5 or more 
inputs for genomes of size N — 2528 operons. In ad- 
dition, the average number of inbound regulatory links 
per operon (for all operons) is (fc) — L/N = IN = 0.19, 
while the average number of inbound regulatory links for 
regulated operons is {kr) = L/Or ~ A more accu- 
rate calculation using the specific values for E. coli gives 
(fcr) — L/Or — 1.10, very close to the E. coli value of 1.5 
or 1.6 noted in Refs. [MEllll- 



B. Scale- free distribution of regulator operons 

On entry into the genome, node Uk sources on aver- 
age Ik outbound regulatory links and this linear growth 
in link number means that more recent nodes are more 
likely to be immediately regulatory and more likely to 
be highly connected on genome entry. However, node Uk 
will also receive on average Ik inbound regulatory links 
whose tails will be preferential attached to existing regu- 
lators. The final distribution of link number with age will 
depend on the rate at which earlier nodes under preferen- 
tial attachment can attract links relative to the linearly 
increasing link numbers of later regulatory nodes. 

On entry at time fc, node rifc receives hkk ~ Ik inbound 
links from existing regulatory nodes in the set ni , . . . , . 
As previously, we gain insight by considering the general 
case where tkk ~ hkk ~ Ik" for a > (though we con- 
tinue to use the distribution P{j, k) of Eq. Oto determine 
both the number of links j prior to exponentiation and 
regulatory probability so consequently the number of reg- 
ulators continues to increase quadratically according to 
Eq. EJ. As a result, the need to ensure that all regulatory 
links to node Uk are distinct requires that new link num- 
ber Zfc" be less than the number of existing regulators 
Ik^ requires a < 2. The hkk new tails added with node 
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Uk are preferentially attached to the existing regulatory 
nodes rij with probability proportional to the number of 
existing regulatory links for that node at time fc, i.e. tjk- 
Using the continuous approximation '37", '38', , the rate 
of growth in outbound link number for node rij is then 
approximately 



ok 



lo tjk dj 



(17) 



The denominator here is a probability weighting to en- 
sure normalization and is the total number of outbound 
links for all nodes. Following we can evaluate the 
denominator using the identity 



d 



d 



(18) 



This can be evaluated using Eq. El noting t^k ~ hkk 
Ik" giving 



d r 

3k L '^^"^--^^ 



(19) 



which can be integrated determining the denominator of 
Eq. inito be 



tjk dj 



21 



-k 



a + 1 



a+l 



(20) 



This is in agreement with Eq. |H1 Substituting this value 
into Eq. 1171 gives 



dtjk _ a + l tjk 



dk 



2 k 



(21) 



Finally, this can be integrated with initial conditions 
tjj « at time j and final conditions tj^ at time N to 



give 



IN- 



J 



(22) 



Again we find that the respective choices a < 1 and 
a > 1 lead to monotonically decreasing and increasing 
numbers of links per node as a function of node num- 
ber, while setting a — 1 ensures the number of links per 
node is independent of node number. In this case, the 
preferential attachment of links to earlier nodes does in- 
deed act to cancel the initial bias in link number towards 
later nodes. It is also apparent that when a — 1, the 
limit I 1 implies all nodes possess exactly N links as 
expected for a fully connected regular network. (Prefer- 
ential attachment cannot distort connectivity numbers in 
this case as all nodes have an equal number of links.) Ad- 
ditionally, in the limit I — » we have tjjsf = as required 
for an entirely disconnected network. The case a = 
duplicates results found for growing networks which add 
a constant number of links with each new node subject 
to preferential attachment ^] . 



As previously, it is straightforward to calculate the fi- 
nal outbound link distribution in the case a — 1 using 
Eq. im This gives 



1 

T{k,N) = - J djS{k-lN) 
^ S{k~lN). 



(23) 



Again, we find the expected compact distribution result- 
ing when all nodes possess the same average number of 
links. This raises the question however, of how it is that 
a probabilistic accelerating network subject to preferen- 
tial attachment can end up with all nodes possessing 
the same average number of links? The answer lies in 
our use of two classes of distinguishable nodes, regula- 
tors and non-regulators, which requires that we take into 
account the known distribution of regulators with node 
number over the genome. The average link number per 
node at node rij (Eq. I22|) equates to the product of the 
average number of link tails per regulator at node rij, 
denoted tr{j,N), and the average number of regulators 
per node at node rij, denoted p{j). This latter density is 
p{j) — dR{j)/dj « Ij by Eq. so by definition, we have 



givmg 



tjN = tr{i,N)p{j), 



tr{],N) = N^r 



(24) 



(25) 



Hence, for a < 3, the average number of links per regula- 
tor is a decreasing function of node number j as the grow- 
ing number of links added to recent nodes is insufficient 
to outweigh the effects of preferential attachment which 
more rapidly increases the number of links attached to 
early nodes. In particular, for a = 1 with the addition 
of a linearly increasing number of links per node, the 
average number of regulatory links per regulator scales 
inversely with node number j. In other words, the den- 
sity of regulators is very low at small node numbers j 
while the very few regulatory nodes in this stretch of the 
genome are heavily connected due to preferential attach- 
ment so as to maintain the constant average of Eq. 
(See Fig. 121) 

The tr {j, N) distribution contains information about 
both node connectivity and node age and so approxi- 
mates genome statistics (simulated or observed) when all 
of this information is available. However, it is usually the 
case that node age information is unavailable necessitat- 
ing calculation of connectivity distributions that are not 
conditioned on node age. This effectively requires bin- 
ning together all nodes irrespective of their age to obtain 
a final link distribution. In the case of linearly growing 
number of links per node, a = 1, the delta function of 
Eq. E] is resolved by the equality j — N / k giving the 
final distribution as 



i_ /dj_ 

N \dk 
1 



(26) 
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which, as required, is normahzed as J^T{k,N) = 1. 
The expected proportion of regulators -Pt(fc) possessing 
k hnks is then obtained by integrating the continuous 
distribution of Eq. 1261 over appropriate ranges [1,3/2] or 
[k - 1/2, k+ 1/2] to obtain 



log P,(k) 



Ptik) 



ik'-'-l 



k = 1 



k>l. 



(27) 



These theoretical predictions compare well to simulations 
of networks of various sizes with linearly increasing num- 
bers of probable links per node and subject to preferen- 
tial attachment. Fig. |S1 shows simulated outbound link 
distributions which arc long-tailed and scale free with 
probabilities scaling roughly as Pt{k) oc k~^ for large k. 
The -Pt(fc) distribution shows a full one third of regula- 
tors have only one link, while 60% have two or fewer links, 
and 71% have three or fewer links. Fig. Elshows the long- 
tailed distribution Pt{k) expected for a simulated E. coli 
network of = 2528 operons with preferential attach- 
ment of links. This figure shows marked similarities to 
the lon g-ta iled distribution of E. coli shown in Fig. 2(c) 
of Ref. lU ■ In particular, the expected number of regu- 
lators with k links is Pt{k)R{N) with the number of reg- 
ulators R{N) obtained from Eq. |Sl(or from observation). 
For E. coli, this predicts the probable existence of about 
one regulator possessing link numbers in each of the re- 
spective ranges between [40, 49] links, between [50, 64] 
links, between [65,94] links, between [95,169] links, and 
between [170, 700] links for instance. (This approximates 
the connectivity of the global food sensor CRP which reg- 
ulates up to 197 genes directly The average of the 
P{k) distribution (as well as the tr{j,N) distribution) is 
formally undefined as long as the integration limits are 
taken to infinity. However, in a network of N nodes, a 
regulator can practically only regulate a total of N nodes, 
and this cutoff allows us estimate the average connectiv- 
ity per regulator (complementing previous estimates fol- 
lowing Eq. [71). Using the cutoff and approximating the 
summation via an integral, the average connectivity per 
regulator in a network of N nodes is 

N 



(fc) = ^fcPi(i^) 



fc=l 
1 1 



In 



47V^ - 1 
15 



(28) 



(or simply In N using the continuous distribution of Eq. 
1261 ) The average number of links per regulator for E. 
coli from Eq. [2Hlis (fc) = 7.51 (or 7.83 using the simpler 
derivation), which a gain compares well to the observed 
value of 5 in E. coli 1321. 



IV. INHERENT PROKARYOTE SIZE LIMITS 

The accelerating nature of regulatory gene networks 
necessarily means that these networks must exhibit a 




log k 



FIG. 5: A simulation of the proportion of outbound links per 
regulator Pt(k) in networks of various sizes with linear growth 
in the probable number of links per node preferentially attached 
to regulatory nodes. The log-log plot shows slopes of roughly 
—2 m agreement with theoretical predictions (heavy solid line) 
of a long-tailed scale free distribution Pt{k) oc k~^ . 
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FIG. 6: The predicted proportion of regulatory operons Pt{k) 
regulating k different operons for a simulated E. coli genome 
with N = 2528 operons. As expected, most regulators regulate 
only one other operon, though a small number of regulators 
can regulate more than 40 operons. This distribution closely 
approximates the observed proportions for E. coli in Fig. 2(c) 
of Ref. '31] and Fig. 4 of Ref . '34], and predicts the probable 
existence of about one E. coli regulator possessing link num- 
bers m each of the respective ranges between [40, 49] links, 
between [50, 64] links, between [65, 94] links, between [95, 169] 
links, and between [170, 700] links, and so on. 



transition at some critical network size either to a nonac- 
celerating architecture permitting continued growth or 
must cease growth entirely, and we now seek to predict 
the location of this transition point and compare it to the 
evolutionary record. We begin by examining an overview 
of the accelerating genome model. Fig. [7| shows that lin- 
ear growth in link numbers per node (a = 1) allows a 
quadratic growth in the total number of links (Eq. 0)) 
despite each of the number of regulators (Eq. O and the 
number of regulated nodes (Eq. I15|) asymptoting to some 
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fraction of A'' after an initial period of quadratic growth. 
For large genomes, almost all new nodes will be regu- 
lators and densely connected into the existing network 
which will then multiply regulate almost every node. 
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FIG. 7: The quadratic growth in the number of regulatory links 
L, and the asymptoting quadratic growth of regulatory operons 
R and of regulated operons Or in relation to the total number 
of operons N . Actual numbers of regulators for 89 prokaryote 
genomes are shown (solid points), while the non- asymptoting 
fitted quadratic curve Rq is shown for comparison. The ob- 
served maximum size of prokaryote genomes (of order 10,000 
genes or about 6,000 operons) lies near the transition point 
between sparse and dense connectivity as an increasing pro- 
portion of operons become linked into the regulatory network. 

The transition from sparse to dense connectivity occurs 
as an increasing proportion of operons become linked into 
the regulatory network leading to the emergence of a sin- 
gle giant component of fully connected nodes. One way to 
highlight this transition is by determining the proportion 
of transcription factor which control downstream regula- 
tors as such linkages create the single giant component. 
The proportion of regulators controlling regulators is 

^ ' R{N) ^ L V > \ k N 

« IN. (29) 

Here, the first fraction on the RHS normalizes the pro- 
portion in terms of the number of regulators R{N) (Eq. 
IS)), the first term in the summation is the probability 
that node rik is a regulator, the second term is the av- 
erage number of regulatory outbound links for this reg- 
ulatory node tr{k,N) at network size N (Eq. 1251 with 
a = 1), and the third term approximates the probability 
that these nodes link to one of the existing regulators 
under random attachment. (If the very first and very 
last terms are dropped, the remaining summation over 
all nodes of the probability that Uk is regulatory with 
the stated number of links equates to the total number 
of links in the network L « IN'^. This is the more accu- 
rate version of the calculation leading to Eq. I^Sl) Hence, 



the proportion of regulators which control transcription 
factors scales linearly with network size and equals 15% 
for an iV = 2000 network, 29% for N = 4000, 44% for 
N = 6000, 59% for N = 8000, 73% for N = 10000, 
and 88% for N = 12000 operons (after which the ap- 
proximations made break down). Naturally, when most 
regulators themselves control other regulators, then the 
entire regulatory network will consist of a single giant 
component. These ratios compare reasonably well with 
those observed in E. coli where Ref. noted that of 
121 transcription factors for which one or more regula- 
tory genes are known, 38 factors or 31.4% regulate other 
transcription factors. The approximate second line of Eq. 
|2niwith N = 2528 for E. coli determines this proportion 
as Prr — 18.5% while the more accurate top line gives the 
proportion of regulators which control transcription fac- 
tors as Prr = 17.7%, giving a reasonable match between 
prediction and observation. 

As the proportion of regulators of transcription factors 
rises, the probable length of regulatory cascades will in- 
crease. In fact, the proportion of regulators taking part 
in a regulatory cascade of length n > 1 is 

P„-(l-Prr)P;r'- (30) 

This equation can be obtained from a tree of all binary 
pathways which at each branching point either terminate 
with probability (1 — Prr) or cascade with probability 
Prr- As such, the probable cascade length is negligi- 
ble when the proportion of regulators controlling regu- 
lators is small Prr <C 1 but can become large as Prr 
itself increases. As Prr is indeed large for networks of 
size N > 6000, this again suggests that long cascades 
of regulatory interactions will lead to the coalescing of a 
single giant component in this regime. Again, the calcu- 
lated lengths of regulatory cascades can be compared to 
those in E. coli where the number of cascades of regulated 
transcription factors observed in a particular set of regu- 
latory interactions was 23 two-level cascades or 37.7%, 32 
three-level cascades or 52.5%, and 6 four-level cascades 
or 9.8% [23|- As one- level or autoregulatory interactions 
are not included in this observation, the predicted pro- 
portions for E. coli aiepn = Pn/ (1— Pi) with Prr = 17.7% 
giving 82% two-level cascades, 15% three-level cascades, 
3% four-level cascades, 1% five-level cascades, and so on. 
It is seen that the theoretical predictions overestimate 
the proportion of two-level cascades and underestimate 
the number of three-level and higher cascades probably 
because of selection pressures not included in the model. 
Lastly, we note that the number of cycles involving closed 
regulatory loops of size greater than one (i.e. involv- 
ing more than autoregulation) in the examined portion 
of the E. coli regulatory network is zero reflecting that 
feedback loops in these organisms are carried out at the 
post-transcriptional level involving metabolites such as 
appear in the lac operon [s^ . l33L l3^ . 

We note that our model is entirely unable to explain 
the high proportion of autoregulation observed in E. coli 
with various estimates that 28.1% fi^l, 50% ^ and 
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46.9% [33 of regulators are autoregulatory. The pre- 
dicted proportion of autoregulators is approximated by 
replacing the very last fraction {R/N) in Eq. 123 by the 
term 1 /N giving the probability that a self-directed link 
is formed, leading to the expected autoregulatory pro- 
portion « 2/iV « 0.08% for E. coli. This failure hkely 
reflects the action of selection processes promoting spa- 
tial rearrangements of entire regulons on the genome and 
the internal shuffling of genes and promotor units. Such 
reorganizations of duplicated gene regions (presumably 
shuffling genes and promotor regions) have been com- 
mon in E. coli allowing for instance, spatial regulatory 
motifs whereby the promotors of colocated (overlapping) 
and often co-functional operons transcribed in opposing 
directions can interfere 41]. 

The transition point from sparse to dense connectivity 
can be rough ly l ocated using the continuum approxima- 
tion js^, m, y^. These methods have not previously 
been used for this purpose (to our knowledge) and we 
first validate their use by deriving the known result that 
non-growing random graphs of A'^ nodes connected by 
an increasing number of L undirected links undergo a 
phase transition from sparse to dense connectivity when 
L = N/2 As the number of links L grows, the N 

nodes are interlinked into firstly separate islands of size Si 
nodes for i = 1, 2, . . . which eventually link up to form a 
giant component designated si containing essentially all 
nodes si « N. The largest component grows whenever 
a newly added link has either its head or tail in island 
si (with probability si/N) and the other outside it (with 
probability {N — si)/N) leading to a size increment equal 
to the average size of the external islands {{sj^i)), giving 
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(31) 



Numerical or analytic integration of this equation with 
initial conditions si = 2 when L ^ 1 and assuming the 
average size of external smaller islands is si/2 shows the 
largest island saturating the entire network when L — 
N/2 as expected. (This simple approach is indicative 
only and is quite sensitive to for instance, the assumed 
average size of external islands.) 

This result suggests the following transition point in 
directed regulatory gene networks. Each undirected (i.e. 
bidirectional) link in random graph theory is equivalent 
to two directed links allowing bidirectional traffic be- 
tween any two nodes, suggesting a transition point in 
directed graphs at roughly L = N. This analysis sug- 
gests that the largest component is expected to satu- 
rate the entire network when link number L sa iV or 
N ^ l/l ^ 13,677 (see Fig. [7|). In turn, this suggests 
that for N < 13, 677 a typical network likely consists of 
isolated trees, while if > 13, 677 the network likely 
consists of a single giant cluster where almost every node 
is connected to all others via intermediate links. When 
the link number is very large, N ^ 13,677, then the 
network becomes regularly connected 0- As prokaryote 
regulatory networks likely consist of functionally distinct 



regulated modules [33, |43| , it is unlikely that prokaryotic 
gene networks can successfully operate in the fully con- 
nected regime suggesting that prokaryote genome sizes 
are size constrained N < 13, 677. In fact, the previously 
noted absence of regulatory cycles in E. coli _32, 33_^ ,3J| 
likely reflects the evolutionary importance of maintaining 
disjoint and non-interfering regulatory units. 

These results of random graph theory are suggestive 
only, and we now turn to consider the size of the largest 
connected island in prokaryote gene networks featuring 
directed links whose tails are preferentially attached to 
regulators and whose heads are randomly distributed 
over all existing nodes. A further difference is that 
prokaryote regulatory networks are themselves growing 
with each added node accompanied by a probabilistic 
number of links. In addition, we define an island to 
consist of all nodes which are linked regardless of the 
orientation of all links and so effectively treat links as 
being undirected. This is because a regulator can po- 
tentially perturb every node downstream to it includ- 
ing those nodes downstream of other regulators and so 
can modify the regulatory effects of other regulators — 
essentially, if the downstream effects of different regula- 
tors eventually intersect, we count these regulators in the 
same island. (Other definitions of islands could be used.) 
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FIG. 8: The total number of discrete disconnected islands iaii, 
the number of islands with respectively two (i2), three (is) and 
four (ii) members (left hand axis), and the simulated ({si)) 
and predicted (si) size of the largest island measured as a 
proportion of nodes for various genome sizes (right hand axis). 
The simulations show the largest island contains (si) — 50% 
of all nodes at a critical network size of Nc = 9, 029 nodes. 
The input parameters of the predicted curve si are set so si = 
(si) at this point. 

The dominant (but not sole) mechanism by which is- 
land si can grow is for the newly added node nk to either 
(a) be a regulator (with probability [1 — (1 — I)'']) and 
establish an outbound regulatory link to some existing 
node in si (with probability si/k) while at the same time 
accepting a regulatory link (with probability [1 — (1 — Z)'']) 
from a node in a different island sj^i (with probability 
(/c — si)/fc), or (b) accept an inbound regulatory link 



12 



(with probabihty [1 — (1 — 1)'^]) from a regulator in island 
Si (with probability Si/k) while establishing a regulatory 
link (with probability [1 — (1 — I)*']) to some node in a 
different island Sj^i (with probability (fc — si)/A;). (Here, 
we assume that regulators are uniformly distributed over 
islands and the number of links within an island scales 
with the size of the island to crudely model preferential 
attachment.) The result is that island si grows by the 
size of the second island assumed to be Sj^i. Altogether, 
the rate of growth in the size of island Si is then 



,2 si[k — Si] 



(32) 



For initial conditions, we assume that a first link ap- 
pears when the genome has (pffp)"^^'^ — ^"^"^ nodes 
(si(177) = 2). Simulations show that sufhcient small 
islands are created to ensure {sj^i) remains roughly con- 
stant and equal to (sj^i) — 2.72, though matching the 
simulated and predicted curves at the 50% point requires 
setting {sj^i) — 30. This is reasonable given the ap- 
proximations made. Fig. |S| shows the size of the largest 
island si as a proportion of all nodes. A single giant 
component is expected to form at a critical genome size 
of Nc = 9, 029 operons defined as the point where the 
simulated proportion of nodes in the giant component is 
50%. (Choosing a parameter setting of 40% would also 
be justifiable and would lead to an exact match between 
predicted and observed maxima.) Unlike random graph 
theory, this critical point applies to all growing genomes 
as it is determined by the value of the link formation 
probability I. Genomes of smaller size than this critical 
value N < Nc are expected to be sparsely connected so 
the network consists of multiple discrete connected is- 
lands (as in E. coli |33l|). while genomes of larger size 
N > Nc are expected to be densely connected into a 
single giant component where every regulator eventually 
perturbs the downstream effects of every other regulator. 

Simulations of example genomes of various sizes span- 
ning this critical network size confirm the adequacy of the 
continuum treatment. Fig. [Slshows the number of all dis- 
crete islands as well as the number of islands containing 
two, three and four components. In the vicinity of the 
critical genome size Nc = 9, 029, the number of discrete 
interconnected islands begins to decline as the growing 
number of links connects more and more islands into the 
single giant component. The size of the simulated giant 
component as a proportion of genome size is also shown. 
This figure suggests that the E. coli genome of = 2528 
operons should possess a giant component containing 
about 5% of all nodes (about 100 nodes) which can be 
compared with the observation that about 70% or 300 
operons of the examined regulatory and regulated oper- 
ons (but not including unregulated and nonregulatory 
operons) could be loosely grouped into 3-6 "dense over- 
lapping regulons" or DORS while the remaining operons 
appeared as disjoint systems with most containing 1-3 
operons but some containing up to 25 operons js^- 

The critical network size of Nc = 9,029 operons or 



about Ng = 15, 349 genes corresponds to the point where 
growing regulatory networks exploiting accelerating links 
can no longer maintain discrete functional units, islands, 
of interconnected nodes. Larger genomes are densely con- 
nected into a single giant component where, eventually, 
any regulator can perturb the downstream effects of ev- 
ery other node so for instance, it is unlikely that the 
discrete network motifs found in the E. coli regulatory 
network |32j | can survive in this regime. This massive in- 
crease in perturbative effects immeasurably increases the 
difficulty of the evolutionary search process, leading to 
an expectation that the rate of evolutionary change will 
drastically slow when growing genome sizes reach criti- 
cality N Nc- From a biological point of view, it is rel- 
atively easy to understand why the critical network size 
Nc acts as an upper size limit. The accelerating nature 
of the prokaryote regulation network means that larger 
networks can add new nodes only be integrating an in- 
creasing number of links to gain evolutionary benefits. 
Of course, the probability of finding IN beneficial links is 
a rapidly decreasing function of A'^. It is relatively easy 
to find a beneficial regulator making only of order one 
link to existing genes (only billions of trials are needed 
say), but much harder when the regulator is making an 
average of five links with existing genes (many trillions 
of trails are needed). Essentially, the more links that 
must be beneficially integrated, the longer the evolution- 
ary search task and the slower the rate of evolutionary 
change. 
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FIG. 9; The predicted proportion Po{k,N) of operons with 
0,1,2,... regulatory inputs as a function of network size. 
Small networks mainly possess unregulated operons, while net- 
works of large size have a significantly reduced number of un- 
regulated operons with many operons taking large numbers of 
regulatory inputs. 



Many other statistical measures suggest that the regu- 
latory mechanisms optimized to perform in a sparsely 
connected network will not necessarily operate in a 
densely connected network — evolution cannot foresee 
later needs. In particular, the proportion of operons Uj 
which are regulated by k inputs is, using Eq. 1131 given 
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by 



i=i 



\N-k 



(33) 



This distribution increases with increasing network size 
and is shown in Fig. |5| making it clear that smaU net- 
works mainly possess operons which are either entirely 
unregulated or regulated by only one or a few regula- 
tors. In contrast, large networks (N > Nc) have only a 
small proportion of operons which are unregulated while 
the majority of operons take between one or more regu- 
latory inputs. It is a more difhcult evolutionary task to 
integrate many inputs to achieve a beneficially regulated 
output again suggesting that prokaryote regulatory net- 
works featuring accelerating growth in link number are 
size limited due to their regulatory architecture. 

Another way to suggest the strict size limits imposed 
by the accelerating growth of regulatory links is to con- 
sider the probability that the most recently added node 
TT-Af in a network of size N immediately becomes regula- 
tory. Using Eq. |S1 node n^r is a regulator with probabil- 
ity 



JV 



N 



(34) 



This probability tends to unity as network size increases, 
and in particular, surpasses about 50% when networks 
consist of Nc operons — see Fig. ^1 At about this stage, 
large networks cannot add a new node without it having 
a significant probability of modifying the dynamics of ex- 
isting nodes. This immeasurably increases the difficulty 
of the evolutionary task and again suggests a maximum 
size limit to prokaryote gene regulatory networks. 




FIG. 10: The rapidly increasing probability Pr{N) that the 
most recently added node un tn a network of size N nodes 
is immediately regulatory on its appearance in the genome. 
For network sizes greater than about Nc ~ 9, 029 operons, 
the probability that all new nodes are immediately regulatory 
exceeds about 50%. 

If the accelerating regulatory networks of prokaryotes 
were able to operate in the densely connected regime, the 



evolutionary record might be expected to show prokary- 
otes of arbitrarily large genome size with a transition 
in connectivity statistics at some critical genome size 
of about Nc ~ 9, 029. Conversely, should these regu- 
latory networks be unable to operate in the densely con- 
nected regime, then the evolutionary record should show 
a maximum size limit to prokaryote genome sizes of about 
Nc ~ 9, 029 operons or about Ng = 15, 349 genes, close 
to the observed upper limit. 



V. CONCLUSION 

In this paper, we generalize models of accelerating net- 
works by including probabilistic links to allow arbitrarily 
rapid acceleration rates leading to structural transitions 
in growing networks sometimes severe enough to strictly 
constrain network size. These structural transitions from 
sparse to dense connectivity are made more difficult by 
any additional steric or logical limitations on combina- 
toric control at any given promotor. Such transitions 
are in sharp contrast to the stationary statistics and un- 
bounded growth potential of non-accelerating scale free 
and exponential networks. These probabilistic accelerat- 
ing networks were applied to model prokaryote regulatory 
networks which exploit a quadratic growth in the num- 
ber of regulators and regulatory links with genome size 
as established via comparative genomics programs. Our 
models predict a maximum genome size of Nc ~ 9,029 
operons or about Ng = 15, 349 genes for prokaryotes, 
closely approximating the observed maximum. We fur- 
ther validated our model by making a detailed compar- 
ison of predicted and observed results for E. coli, and 
achieved satisfactory matches for respectively, the num- 
ber of observed regulators, an average promotor binding 
site length of about 7, the long tailed distribution of out- 
going regulatory links with an average of between 2.12 
and 7.51 (compared to 5), the exponential distribution 
of incoming regulatory links with an average of around 
1.10 (compared to 1.5), the proportion of regulators con- 
trolling regulators of around 17.7% (compared to 31.4%), 
and the probable length of regulatory cascades and the 
absence of regulatory loops. Our approach is unable to 
explain the high proportion of autoregulation observed in 
E. coli 32j and this failure likely points to selection for 
genome reorganizations leading to spatial arrangements 
of operons allowing joint regulation |4l| which is not in- 
cluded in this model. Further, this approach does not 
include selection pressures ensuring that similarly reg- 
ulated islands or modules share common functionality 
|32| , or other regulatory mechanisms influencing both the 
transcription and translation of transcription factors in- 
cluding micro- RN As and other chemical mechanisms and 
mediators (see for instance ^3). 

However, the many successes of the accelerating net- 
work model of prokaryote regulatory networks are mean- 
ingless if similar results can be achieved via non- 
accelerating network models. In later work, we will show 
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that the two simplest non-accelerating network models 
fail to explain either the observed quadratic growth of 
regulator number with genome size or the detailed statis- 
tics pertaining to the E. coli genome |27| . In addition, 
the simplifying assumption adopted here that gene du- 
plications ensure that operons become regulatory only 
on entry to the genome will be dropped in later work. 
This will develop a more realistic model including sepa- 
rate physical processes for transcription factor binding to 
DNA and for establishing regulatory links with regulated 
operons where links can form at any time. 

This work has wider significance due to the still com- 
mon presumption in molecular biology that "What was 
true for E. coli would also be true for the elephant" 



capturing the notion that the mechanisms operating in 
prokaryotes are exactly identical to those operating in 
complex multicellular eukaryotes. In this picture, eu- 
karyotes are merely enlarged prokaryotes. The results of 
this paper indicate that this is not possible — the accel- 
erating nature of regulatory networks necessarily implies 
that eukaryotes cannot be scaled up prokaryotes and that 
the (likely) accelerating regulatory networks of eukary- 
otes must be exploiting novel regulatory mechanisms. 
The successful modelling of these mechanisms will likely 
require incorporating; computationally complex technolo- 
gies |43,|4a,|43 into an accelerating network model, and 
this also will be addressed in later work. 
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